BayesKAT: bayesian optimal kernel-based test for genetic association studies reveals joint genetic effects in complex diseases

Abstract Genome-wide Association Studies (GWAS) methods have identified individual single-nucleotide polymorphisms (SNPs) significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT (https://github.com/wangjr03/BayesKAT), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.


INTRODUCTION
Deciphering the genetic basis of complex traits, such as the Alzheimer's disease, plays pivotal roles in functional genomics and precision medicine [1,2].Based on the advancement in highthroughput sequencing techniques, specific associated genetic variants, e.g.single-nucleotide polymorphisms (SNPs), have been identified for a large panel of phenotypes using Genomewide Association Studies (GWAS) [3].However, traditional GWAS approaches treat SNPs independently and can only discover individual SNPs that have strong marginal statistical associations with the phenotype of interest.It is well documented that many complex diseases and phenotypes are often associated with multiple genetic variants [4][5][6][7] where an individual variant itself might be weakly associated with the phenotype.In contrast, groups of such SNPs may jointly contribute to the phenotype, potentially mediated via their cooperative participation in important biological processes or pathways [8,9].Therefore, the traditional GWAS framework of testing individual SNPs separately without considering the correlation structures and the potential interactions among SNPs may not capture the group-wise joint SNP effects.Separate testings of SNP associations by traditional GWAS approaches are also limited to reveal the underlying biological mechanisms of complex phenotypes.Alternative approaches based on multivariate regression significantly suffer from the large degrees of freedom in genome-wide association tests and can substantially lose the statistical power [10].
To overcome this critical challenge, kernel-based testing (KBT) framework has been introduced to test group-wise joint SNP effects [10][11][12][13][14][15][16].By incorporating a kernel function to measure the similarity among genetic variants and compare with the phenotype similarities, the KBT framework simultaneously models the joint effects of multiple genetic variants.Wu et al. [12] first proposed the widely used sequence kernel association test (SKAT) model to test rare-variant associations.As a supervised, versatile and computationally streamlined regression approach, SKAT accesses the associations between genetic variants within a specific region and the trait.As the outputs from the SKAT model, P-values of the statistical associations are generated, facilitating straightforward interpretations of the findings.An R package [17] has been developed for implementing different kinds of KBT models, including SKAT.
To enable novel discoveries of the genetic basis underlying complex diseases, maximizing statistical power in genome-wide association tests while effectively controlling type 1 errors is strongly desired.Under the KBT framework, statistical power heavily depends on the specific choice of kernel functions [16,[18][19][20].However, the existing KBT models, including SKAT, require the kernel function to be specified a priori.Because the true functional relationship between the genetic variants and phenotypes is usually unknown in practice, selecting the ideal kernel function in advance for the KBT model, one that maximizes statistical power without increasing the type 1 error rate, poses statistical and computational challenges.One common approach that has been used is to repeat the KBT procedures based on different choices of kernels and then select the one resulting in the minimum P-value, which has been discussed by multiple studies [18,21].The major problem of this straightforward approach is the inf lated type 1 error.Although data-dependent permutation or perturbation methods [18] can help ease the problem, they are not computationally scalable, especially when applied to highdimensional datasets in large-scale genomic studies.An alternative approach is to use an equal-weighted average of multiple candidate kernels to form an averaged composite kernel [18], which performs better than the worst-performing candidate kernel but does not usually achieve the performance of the optimal kernel function.Tests based on the average kernel approach may lead to inconsistent or incorrect conclusions in applications, as we will demonstrate below.He et al. [21] proposed a maximum kernel test model based on the U statistic, i.e. the mKU model, which claims to achieve the statistical power as close as to the best candidate kernel in high-dimensional settings under certain distributional assumptions.However, the specific distributional assumptions may not hold in practice and hence lead to inf lated P-values, which will also be discussed in this study.
To further illustrate the significance and difficulty of choosing appropriate kernels in genetic association testings, Figure 1A demonstrates an example based on the genotype data for the trait of whole brain volume collected from the ADNI project for the Alzheimer's Disease Neuroimaging Initiative [Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu)].As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report.A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf(https://adni.loni.usc.edu/) [22].The group of genetic variants located in genes belonging to the caffeine metabolism pathway [23][24][25] are included into the KBT model to test the hypothesis that whether the caffeine metabolism pathway is associated with the whole brain volume phenotype.Using different kernel functions, the SKAT model leads to inconsistent conclusions.For instance, based on Quadratic kernel, the SKAT model rejects the null hypothesis (P-value<0.05),while the use of Gaussian kernel or IBS kernel does not lead to any rejection of the null hypothesis Figure 1A.On the other hand, using the equal-weighted average composite kernel, the SKAT model tends to reject the null hypothesis (P-value= 0.047).Since there is no clear mechanistic link between the caffeine metabolism pathway and whole brain volume, the rejection of the null hypothesis based on the Quadratic and the average composite kernels is likely a false discovery.To further quantify this issue, in Figure 1B, a cohort of total 500 replicate synthetic datasets based on the same covariates and genotype data is created, where the phenotype variables are generated by a Quadratic function h(•).Applying the SKAT test based on different kernel functions on the synthetic datasets, inconsistent testing results appear to be a persistent problem.Although the Quadratic kernel function leads to the correct hypothesis testing result as expected, the average composite kernel usually leads to incorrect conclusions (Figure 1B).As shown in the barplot of Figure 1B, using different kernels (Linear, Quadratic, Gaussian and average composite kernels) across the 500 synthetic datasets, inconsistent results are observed and the overall fractions of correct testing results are very low.Hence, the inconsistent conclusions based on different kernels suggest the fundamental need of developing a systematic data-adaptive approach of selecting appropriate kernel functions for KBT models in genome-wide association tests.
In this study, we developed a novel Bayesian Kernel-Based Association Testing algorithm, BayesKAT (https://github.com/wangjr03/BayesKAT).This algorithm effectively tackles the kernel selection challenge by choosing the optimal kernel in a dataadaptive way and calculating the posterior probability of association by evaluating the joint statistical associations of specific SNP groups with a complex phenotype.Moreover, compared with existing KBT-based methods, BayesKAT simultaneously achieves four goals in genome-wide association tests: (i) superior statistical power, by selecting the optimal kernel function based on the dataset under study; (ii) consistent results, by avoiding repeated tests based on a variety of different kernels; (iii) controlled type-1 error, without relying on unverified distributional assumptions or minimum P-value kernels and (iv) strong computational scalability for high-dimensional and large-sample genomewide data.Two alternative computational strategies, i.e.MCMC and MAP, are incorporated in BayesKAT, leading to additional implementation f lexibilities for users.Extensively tested on a series of simulated datasets under different parameter settings, BayesKAT consistently demonstrates superior performance against existing methods.Furthermore, applied on the ADNI genotype datasets of the complex trait of whole brain volume (https://adni.loni.usc.edu),BayesKAT successfully discovered mechanistically related genes and biological pathways with higher accuracy.Specific genes and pathways related with neurodegenerative diseases, as reported by previous studies, are consistently prioritized by BayesKAT while not prioritized by other methods.Strikingly, BayesKAT is able to identify grouplevel SNP effects of novel co-expressed gene modules and protein complexes that potentially participate in the molecular processes modulating the whole brain volume phenotype.These algorithmic advantages and new biological discoveries robustly support the statistical innovation of BayesKAT and strongly highlight its effectiveness in decoding the genetic basis and associated molecular mechanisms underlying complex diseases.

Overview of KBT models for genetic data
Under the kernel machine regression framework, continuous quantitative traits can be associated with genetic variants or molecular features, along with additional covariates, through a semiparametric model: where Y i denotes the continuous value of the trait for the ith person in a sample of size n; ] is a set of k covariates for the ith individual that need to be controlled and is the vector for the p genetic variants or molecular features, where Z ij denotes the jth genetic variant or molecular level feature for the ith individual.The unknown errors i are assumed to be independent and follow N(0, σ 2 ), where the value σ 2 is also unknown.It has been shown [11] that the kernel machine regression model in equation ( 1) is equivalent to the following linear mixed model: where β ∈ R k is a vector of effect sizes for covariates X ∈ R n×k , h is an n × 1 vector of random effects which is distributed as h ∼ N(0, τ K), where K is the n × n kernel matrix and the error is distributed as ∼ N(0, σ 2 I), where σ 2 is the error variance and τ is a variance component for the genetic effect.The main goal is to test if the genetic variants have any combined effect on the outcome variable Y. Testing for the presence of group effect of Z is equivalent to testing the hypothesis H 0 : τ = 0 versus H 1 : τ > 0. Choosing an appropriate kernel is crucial and that topic has been discussed further in the supplementary files.

BayesKAT
Our new algorithm, BayesKAT (https://github.com/wangjr03/BayesKAT), employs a novel Bayesian modeling strategy to automatically select the optimal composite kernel based on the data and does not require the composite kernel function to be set a priori by the user.Based on the inferred optimal composite kernel function, BayesKAT can efficiently test the joint effects induced by a group of genetic or molecular features associated with a phenotype.The optimal composite kernel is a linear combination of candidate kernels where the weight of each candidate kernel ref lects the degree of usefulness of the kernel explaining the complex relationship between a group of features and the phenotype of interest.As an illustration, suppose there are three potential kernels: Quadratic, Gaussian and IBS.If the IBS kernel effectively captures the underlying relationship, it will carry greater significance within the composite kernel, hence have a larger weight, while the impact of other kernels may be relatively weak, as indicated by their lower weights.Figure 1C provides an overview of the workf low of BayesKAT and its two computational strategies: (1) the Markov Chain Monte Carlo (MCMC) strategy; and (2) the Maximum a posteriori (MAP) strategy.Additionally, it is noteworthy that while BayesKAT is primarily developed to test genetic associations, it can also be employed for a wide range applications, including testing the association between continuous gene expression features and complex traits.
Consider a set of m candidate kernels Therefore, selecting the optimal composite kernel is equivalent to selecting the optimal value for the weight so that it can capture the underlying relationship between the genetic or molecular features and the trait, when testing the group-level effect of a set of multiple features.
As a KBT model, BayesKAT relies on a set of candidate kernel functions, which are incorporated to infer the optimal composite kernel for the association tests.For the convenience of practical implementations, BayesKAT infers the optimal composite kernel consisting of three candidate kernels as the default setting.And the default candidate kernels include Quadratic, Gaussian and IBS kernel.To construct a composite kernel, the candidate kernels are normalized in BayesKAT based on the previously proposed technique [21] so that they are in the same scale and comparable.
To gain robust performance, weakly informative prior distributions are used for model parameters by default, although the users can incorporate more informative priors based on specific knowledge about the data.The important model parameters are θ = [ ρ, τ 1 , σ 2 , β], where ρ = [ ρ1 , ρ2 , ..., ρm ] are the unscaled weights of the candidate kernels ( And the weakly informative prior distributions are

The actual weights for candidate kernels
where the composite kernel

BayesKAT strategy
Here the main hypothesis to test is Bayes factor(BF 10 ) is calculated to test the hypothesis, which evaluates the evidence in favor of the alternative hypothesis.Bayes factor is defined as the ratio of marginal likelihoods under two hypotheses [26]: where P(Data|H 0 ) and P(Data|H 1 ) are the marginal likelihoods under H 0 and H 1 , respectively.Given the input data, BayesKAT mainly uses two efficient and easy-to-implement strategies to select the composite kernel and calculate the Bayes Factor BF 10 by estimating P(Data|H 0 ) and P(Data|H 1 ).The two computational strategies are explained in subsequent sections.

Interpreting BayesKAT output
The preliminary output from BayesKAT, BF 10 , is a summary of evidence provided by the data in favor of H 1 as opposed to H 0 .In addition, the posterior probability of the association (P(H 1 |Data)) is calculated as the final output, which has a one-one relation with BF 10 , i.e.

P(H
Here P(H 1 ) and P(H 0 ) are the prior probabilities under H 1 and H 0 , respectively, i.e. the probabilities of existence of association and no association, respectively.Depending on the specific biological problem and dataset, the values of P(H 0 ) and P(H 1 ) can be set given biological evidence or prior knowledge, and P(H 1 |Data) can be calculated.P(H 1 |Data) is the posterior probability of the model under H 1 given the data.P(H 1 |Data) gives a quantitative evaluation of how probable there exists an association between the genetic features and the phenotype of interest or how strong the evidence is against the null hypothesis.
As already demonstrated by the example in Figure 1A and 1B in the introduction section, existing methods based on pre-selected kernels or average composite kernels suffer from inconsistent hypothesis testing results and may lead to false discoveries.In contrast, for the Figure 1A scenario, BayesKAT calculated the posterior probability of a true association given the data P(H 1 |data)=0.19,i.e. the evidence of association is very low and the existence of true association is not very likely.This is consistent with the fact that there is a lack of documented evidence of the association between the caffeine metabolism pathway and the whole brain volume phenotype.Strikingly, in the scenario of the synthetic data cohort simulated with known association based on the Quadratic kernel (see Figure 1B), BayesKAT successfully calculated the posterior probability P(H 1 |Data) = 0.99 to suggest the existence of association, without incorporating any prior information.Moreover, tested on 500 repeated simulation cohorts, BayesKAT achieves much higher statistical power than other methods and also demonstrates more consistent testing results (see Figure 1B).Investigating the inferred weights for each candidate kernel functions in the final composite kernel selected by BayesKAT further shows that the Quadratic kernel is correctly assigned with the largest weights (Supplementary Figure S1), suggesting that BayesKAT can efficiently capture the functional form of the underlying statistical associations in a data adaptive way.

BayesKAT-MCMC
As a Bayesian model, BayesKAT-MCMC employs the MCMC sampling-based strategy to infer the optimal composite kernel function.Leveraging MCMC for efficient and traceable samplings from complex target distributions, BayesKAT-MCMC avoids direct sampling from the posterior distribution P(θ |Data) (see section 2.5), which does not have a closed mathematical form and is computationally intractable.Instead, Metropolis-Hastings method [27,28] is used to iteratively draw samples based on the generated Markov chain, which are able to approximate the target probability distribution P(θ |Data).
Let θ 0 denote the initial value for θ .The tth iteration of the Metropolis-Hastings algorithm consists of the following steps [29,30]: 1. Sample a candidate point θ t from a proposal distribution J t (θ * |θ t−1 ). 2. Calculate the acceptance ratio for jumping to the new point 3. Set θ t = θ * with probability r 1 and θ t = θ t−1 with probability 1 − r 1 .That is, it jumps to the new proposed value with probability r 1 and stays at the same value with probability 1 − r 1 .
Here is the main workf low for BayesKAT-MCMC: Input: Genotype matrix Z, covariate matrix X, response y.
Define the data distribution based on θ H i , X, Z and y, as defined in (3) where i =0 or 1 Step 1: Using the Metropolis-Hastings MCMC method mentioned above, draw samples from the posterior distribution of θ H1 using three separate MCMC chains, each of which ran 50 000 iterations.
Step 2: check if the algorithm has converged, otherwise run more iterations.
Step 3: Draw samples from the posterior distribution of θ H1 by similarly Repeating steps 1 and 2.
Step 4: Using the posterior samples of θ H1 , the unknown parameters are estimated: From ρ, the optimal kernel weights ρ can be estimated.
BayesKAT-MCMC uses the R package BayesianTools [32] to generate two sets of samples from the posterior distribution of θ under the hypotheses H 1 and H 0 , using the Metropolis-Hastings algorithm in an adaptive way [33] to leverage the history of the stochastic process and appropriately fine-tune the proposal distributions.Three separate MCMC chains initiated from different random start points are generated for 50 000 iterations.To ensure that the MCMC chains are converged, trace plot is used to visualize the moves of the Markov chains in the state space [34].In addition, based on the Gelman-Rubin diagnostic method [35], the potential scale reduction factor, i.e.PSRF score, is also calculated and presented at the end of MCMC sampling to inspect whether the chain is converged in which the PSRF score is close to one.
Based on the generated posterior samples, the marginal distributions of the parameters are further visualized as shown in Figure 1D.The posterior samples are used to estimate the composite kernel weights and also the marginal likelihoods P(Data|H i ), i = 1, 0 using the Chib's method [31].Bayes Factor is subsequently calculated using the formula presented in equation (4).

BayesKAT-MAP
Although BayesKAT-MCMC yields comprehensive information of the marginal distributions of model parameters, drawing large sets of samples from the posterior distributions in the MCMC strategy is computationally expensive.Here, we provide an alternative strategy, termed BayesKAT-MAP, which is easy to implement, allows parallel calculations and has higher computational scalability.Instead of drawing numerous MCMC samples from the posterior distributions and then estimating the parameter values, BayesKAT-MAP employs the quick optimization technique to estimate the parameters of interest directly, based on the MAP strategy such that where π(θ) denotes the prior distribution θ .Because the objective function is nondifferentiable at some points, a derivative-free optimization algorithm by Quadratic approximation using the R package Minqa [36] is implemented.The most important model parameter τ 1 indicates the existence of association between the feature set and the trait variable, with τ 1 = 0 suggesting that there is no association.In BayesKAT-MAP, if the calculated MAP estimator of τ 1 , i.e. τ1MAP = 0, it follows that the Bayes Factor =0 (i.e.no evidence of association is found), and the computational process terminates.On the other hand, if τ1MAP > 0, it implies that there might be some evidence of association and BayesKAT-MAP proceeds to calculate the MAP estimator again under H 0 and then computes the marginal likelihoods P(Data|H i ), (i = 1, 0), along with the Bayes Factor.Due to the practical limitations of exact analytical methods, such as relying on specific distributional assumptions, efficient numerical integration approaches [37] are needed to calculate the marginal likelihoods under hypotheses H i , P(Data|H i ) = Pr(Data|θ, H i ) π(θ|H i )dθ , so that the model can be applied on diverse panels of data.BayesKAT-MAP employs the Laplace's method [26,38,39] for approximating the integral and d is the dimension of θ , θ is the mode of the log-likelihood function l(θ |Data) and ˜ is the inverse of the negative Hessian matrix of the second derivative of l(θ |Data) computed at θ .For boundary regions of the parameter space, the Laplace approximation is modified according to the previously developed protocol [40] to accommodate the boundary cases.Based on the estimated marginal likelihood densities, the Bayes Factor is then computed and the posterior probabilities are finally inferred, given user-defined priors p(H 0 ) and p(H 1 ) for which non-informative priors are used as the default setting in BayesKAT.
Here is the summary of the workf low for BayesKAT-MAP: Input: Genotype matrix Z, covariate matrix X, response y.
Define the data distribution based on θ H i , X, Z and y as mentioned in (3), where i =0 or 1 Step 1: θH1 = θMAP is calculated using this formulation 5 using optimization technique.
Step 3: Calculate θH0 using same technique in step 1 under H 0 .
The performance and runtime of BayesKAT-MCMC and BayesKAT-MAP under different settings are systematically compared and further discussed in supplementary files.Because of the superior computational scalability of BayesKAT-MAP, as shown in 1D, the results in the paper are generated using BayesKAT-MAP.

Input data organization and model setup for BayesKAT
Data containing the information of genotypes or molecular features for complex phenotypes can be collected from large public-accessible or user-generated cohorts (e.g.ADNI, UK biobank, All of Us (https://allofus.nih.gov/),GTEx, PsychENCODE (https://psychencode.synapse.org/),etc.).Individual-level data can be pre-processed and efficiently undergo steps of quality controls using software such as Plink [41].Additionally, biological meaningful feature groups need to be defined and created depending on the goals of genome-wide association tests.In this study, we have explored four different biology inspired ways of grouping functionally related genetic variants, including (1) genewise groups: aggregating SNPs located within genomic regions of genes; (2) pathway-level groups: aggregating SNPs situated within genes that belong to a specific molecular pathway; (3) coexpression gene modules: aggregating SNPs located within genes that belong to a specific co-expression module and (4) Proteinprotein interaction(PPI) modules: aggregating SNPs located within genes that belong to a specific PPI module, which may represent a protein complex.

Comparison with other methods and performance evaluation
The performance of BayesKAT is compared with two state-ofthe-art algorithms: (1) SKAT using the average composite kernel, denoted as SKAT(Avg) [18]; and (2) the U statistic-based method, denoted as mKU [21].Both of these two methods are frequentist approaches that maximize the power after restricting the type 1 error to a fixed level of α, such as setting α to 0.05, and have been shown to outperform other existing methods.In contrast, as the first Bayesian model for this problem, BayesKAT uses a fixed threshold on the posterior probability or the Bayes factor, based on previously suggested guidelines [26], to reject the null hypothesis.
To make fair comparisons, the performance of each method (i.e. the empirical statistical power) is evaluated at a fixed and equal empirical type 1 error across all three algorithms.A systematic comparison based on rigorous simulations is presented in the Results section.As multiple groups are simultaneously tested, a multiplicity correction technique is implemented, which is discussed in detail in the supplementary files.

SIMULATION STUDIES: ENHANCED EFFICACY OF BAYESKAT BENCHMARKED ON SIMULATION STUDIES
As defined in the Materials and Methods section, the rows of the feature matrix Z ∈ R n×p correspond to n individuals and the columns correspond to the p features.Depending on the particular genetic association studies, the features may encompass discrete genetic characteristics, like alleles with values of 0, 1 or 2, or continuous molecular features, such as gene expressions.We have conducted simulations for both cases, with Z corresponding to discrete or continuous features, under both low-and highdimensional settings.

Simulation with continuous features
As shown in Figure 2A, the simulations based on continuous features are first conducted to evaluate the performance of BayesKAT using similar scenarios presented in [21].With the specified parameters (n = 500, p = 100, k = 2, m = 3), the feature matrix Z is simulated from a multivariate normal distribution with mean 0 p and an AR(1) correlation matrix R where (R(j, j ) = r |j−j | ).In the simulation, the correlation r is set to be 0.6.The covariate matrix X ∈ R n×2 has one binary covariate generated from a Bernoulli (0.6) distribution and one continuous covariate generated from N(2,1).The covariate coefficients are set as β = (0.03, 0.5) and the outcomes Y are simulated based on model (1).The error term with variance σ 2 = 1 is added as the random noise.Three commonly used candidate kernels are incorporated in BayesKAT: Linear, Quadratic and Gaussian.Different functional forms of h(•) are used to create different scenarios to test the performance.The empirical type 1 errors of each method are calculated based on the specific scenario where h(Z) = 0, i.e. there is no association between Z and Y in the simulated data.The different scenarios are given below: In all the simulation scenarios, as shown in Figure 2B, the empirical power versus empirical type 1 error for BayesKAT is consistently better than that of SKAT(Avg) and mkU.Moreover, sensitivity analyses are conducted to evaluate the effects with different simulation parameters on the final performance of different algorithms.Notably, when the number of features p increases from 100 to 150, while keeping all other parameters fixed, BayesKAT consistently achieves superior empirical power compared with other methods (see Figure 2B).Similarly, when the correlation r increases from 0.6 to 0.8 with p fixed as 150, BayesKAT consistently demonstrates higher empirical power and outperforms other methods.Taken together, these simulation results demonstrate the robust superior performance of BayesKAT compared with SKAT(Avg) and mKU in various settings with continuous features.

Simulation with discrete features
The performance of different models on discrete SNP features is first evaluated based on simulations where randomly selected SNP groups are used as features.Because randomly selected SNPs are generally not functionally related, the overall effectiveness of all KBT models decreases as expected, with BayesKAT still showing improved empirical power compared with other methods (Supplementary Figure S2).Because the real-world implementations of KBT models for genetics studies usually focus on functionally related groups of SNPs, a more realistic strategy of simulating groups of discrete SNP features, instead of randomly selected unrelated SNPs, is employed to further benchmark the performance of BayesKAT (Figure 3A).
To rigorously capture the underlying linkage disequilibrium structures of discrete features in real-world SNP data, the ADNI dataset is used as the basis for the simulations, where the covariate matrix X is created from the real covariates (e.g.age, gender and education) of the corresponding individuals in the dataset, and SNPs located in specific genes belonging to the selected KEGG pathways are included into the testing (see Materials and Methods).Three different KEGG pathways are randomly selected for performance evaluations.For each pathway, the groups of SNPs located in the corresponding gene members are identified.The 'SNPknock' package [42] is then used to simulate the knockoff SNP data, which maintains the structural dependency among SNPs in each group intact in the simulated knockoff SNPs (Figure 3A).The feature matrix Z is constructed based on the simulated SNP data, with n = 755 and p ranging between 4000 and 5000.The outcome variable Y is simulated based on three different scenarios.Each scenario corresponds to a different functional form h(Z), that is, where Z i is the ith column of Z corresponding to the ith SNP.
For each group of pathway-level knockoff SNPs, 500 simulations are generated.By applying BayesKAT and the other methods on the set of simulations, the corresponding empirical type 1 error and empirical power are calculated accordingly.The summary of performance comparisons based on this extensive set of simulations is shown in Supplementary Figure S3.The empirical power and empirical type 1 error for each SNP group clearly demonstrates that BayesKAT robustly outperforms SKAT(Avg) and mKU, across different simulation scenarios and settings.The averaged performance over three pathway-level SNP groups can be found in Figure 3B.To demonstrate the robust superior performance of BayesKAT, the simulation study is repeated using different sample sizes, n=1000,1500.The comparison outcomes are illustrated in supplementary Figures S4 and S5.Remarkably, in addition to these consistent advantages, BayesKAT also achieves much lower empirical type 1 error across all simulation settings, when the suggested Bayes Factor threshold [26] is employed.It suggests that the associations between the SNP groups and the phenotype detected by BayesKAT exhibit a significantly higher level of reliability compared with other methods, a crucial attribute for genetic applications in complex diseases.

APPLICATION TO ADNI DATASETS:: BAYESKAT REVEALS NOVEL ASSOCIATED GENETIC BASIS OF COMPLEX TRAITS
To illustrate the novel biological insights generated by BayesKAT, the individual-level data, including the genotype, phenotype and demographic covariates, from the ADNI project (https://adni.loni.usc.edu) [22,43] are used to conduct a series of group-level genetic association testings.Specifically, BayesKAT is used to test the group-wise associations between SNP sets and the complex phenotype of whole brain volume, based on the available information across 755 individuals in the ADNI cohort.

Real data preprocessing
In addition to a series of simulated datasets, real genetic datasets are used to evaluate the performance of BayesKAT and its derived biological discoveries.The data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu).The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD.The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers and clinical and neuropsychological assessments can be combined to measure the progression of mild cognitive impairment and early Alzheimer's disease.Plink software [41] (http://pngu.mgh.harvard.edu/purcell/plink/) is used to pre-process the individuallevel genotype data.Detailed information on data access, download and pre-processing steps can be found in the supplementary material.Four biology-based strategies are used to create the feature groups of functionally related SNPs, leading to complementary biological insights into the genetic basis underlying the specific phenotype.The four SNP feature grouping strategies are: (1) Genewise SNP groups for 18 999 protein-coding genes in the human genome [44].For each protein-coding gene, all SNPs located within +/-5KB of the gene body are collected as the gene-wise SNP feature set; (2) Pathway-wise SNP groups for 352 KEGG pathways [23][24][25].For each pathway, gene-wise SNP groups for all genes belonging to the specific pathway are collected as the pathwaywise SNP feature set.The number of SNPs per pathway varies from 35 to 22 555, with a mean of 1721 SNPs. Figure 5A provides a schematic figure demonstrating the pathway-wise joint SNP testing procedure; (3) Co-expression gene module based SNP groups.A total of 41 co-expression gene modules are identified using the R package 'WGCNA' [45,46] to find correlated gene co-expression clusters from expression data (available on the ADNI website).The (B) Performance comparison across different simulation settings with systematic performance evaluations, i.e. the empirical power versus empirical type 1 error, for SKAT(Avg), mKU and BayesKAT across different scenarios and parameter settings.When increasing p(no of features) while keeping other factors constant, all methods exhibit a slight decline in power, but BayesKAT consistently outperforms SKAT(Avg) and mKU.Additionally, as r(correlation coefficient between features) increases under other conditions fixed, BayesKAT also consistently surpasses SKAT(Avg) and mKU.co-expression gene modules have a different number of genes in them, which varies from 6 to 3361.SNPs within each module are extracted for further group-wise testing and (4) 401 PPI gene module based SNP groups.The PPI gene modules are previously created in [ 47] based on the topology of the PPI network.The number of genes in PPI modules ranges from 2 to 497.

BayesKAT prioritizes functionally related genes
To identify genes associated with the trait of whole brain volume, we conduct a gene-wise association test with gene-level SNPs (see Materials and Methods).BayesKAT prioritized 17 genes, whose posterior probability of association (P(H 1 |Data)) is greater than 0.7. Figure 4A shows some examples of the prioritized genes, which have been suggested to be associated with brain or neurodegenerative disorders by previous seminal studies [48][49][50].The whole list of the 17 prioritized genes and their corresponding posterior probability of associations are provided in Supplementary Table S2.Strikingly, as external evidence in support of BayesKAT's prioritized genes, 12 out of the 17 genes (71%) contain CpGs that have been found to be involved with significant meQTLs [51] in the human brain cortex.In contrast, the genes prioritized by SKAT(Avg) and mKU demonstrate much lower fractions of overlapping with CpGs linked to meQTLs (66% and 61% respectively, Figure 4B).The higher proportion of prioritized genes containing CpGs offers molecular-level support for BayesKAT's ability to uncover genes mechanistically linked to complex traits.

Biological pathways linked to neurodegenerative diseases are top-ranked by BayesKAT
To identify biological pathways that potentially modulate the whole brain volume trait, BayesKAT is used to analyze pathway-level SNP groups (see Materials and Methods, Figure 5A).The top 50 ranked KEGG pathways by each model are summarized in Figure 5B, where BayesKAT ranks the pathways based on decreasing posterior probability of association (P(H 1 |Data)), while the mKU and SKAT(Avg) methods rank the pathways based on decreasing -log 10 (P-value).Interestingly, BayesKAT successfully prioritized most of the neurodegenerative disease related pathways with top ranks (Figure 5B).This is a strong mechanistic support to BayesKAT's results, because the neurodegenerative diseases, including Alzheimer's disease, Huntington's disease, Amyotrophic lateral sclerosis and Parkinson's disease, have been found to be strongly related to brain volume loss [52][53][54][55].Based on a reasonable posterior probability threshold 0.7, there are 21 pathways identified by BayesKAT (Figure 5B, Supplementary Table

S3
), among which five pathways are associated with neurodegenerative diseases.In comparison, the mKU and SKAT(Avg) models prioritized large numbers of pathways (242 and 72, respectively), while only a small fraction of them are neurodegenerative disease-related pathways (6 and 5, respectively).For SKAT(Avg), these functionally related pathways are not even the topranked ones.As summarized in the corresponding pie charts in Figure 5B, BayesKAT achieves the highest efficiency in prioritizing important pathways and resulting in fewer false discoveries than the other methods.These results are also consistent with the larger type 1 errors of mKU and SKAT(Avg) observed from simulation analyses described above.Note that setting the posterior probability threshold is subjective, same as selecting type 1 error threshold or FDR threshold (0.05 or 0.01 or 0.1).Opting for a threshold greater than 0.7 (such as 0.9, 0.95 or 0.99) is also effective, as BayesKAT efficiently prioritizes the top pathways.Overall, the highly prioritized functional relevant pathways imply the novel biological insights that can be generated using BayesKAT.
To further evaluate the performance of BayesKAT in determining the optimal kernel weights, 10 randomly chosen pathways are used to test the pathway-level associations.The resulting composite kernels are compared with the results of using SKAT based on individual kernels separately.The corresponding -log 10 (P-values) metrics from SKAT using individual kernels are compared with the inferred kernel weights in the composite kernels from BayesKAT.As shown in Figure 6A, the high similarity between the two heatmaps indicates that BayesKAT can efficiently select the optimal composite kernel automatically from the data, without relying on prior knowledge or repetitively trying different individual kernels.

BayesKAT identifies trait-associated gene modules and protein complexes
To further demonstrate BayesKAT's capability of revealing novel group-level associations to traits from specific sets of cooperative SNPs, two additional SNP grouping strategies are applied: (1) SNPs from co-expression gene modules; and (2) SNPs from PPI modules (see Materials and Methods).Applied on the SNP groups aggregated from co-expression gene modules, BayesKAT is able to pinpoint specific modules as significantly associated with the whole brain volume trait (Figure 6B Left).On the other hand, SKAT(Avg) and mKU identify a large number of modules (Figure 6B Right), which are consistent with the inf lated type 1 errors of these two methods as observed previously, and failed to provide specific prioritizations of the gene modules.Remarkably, comparing with the significant GWAS SNPs identified from another genome-wide meta-analysis of brain volume study [56], two out of the four selected modules by BayesKAT (50%) contain significant GWAS SNPs (Figure 6C).In contrast, only 26% (7/27) and 43% (6/14) modules selected by mKU and SKAT(Avg) contain significant GWAS SNPs.These results further provide orthogonal support for the superior performance of BayesKAT in identifying new collective associations to the complex traits for SNP groups of functionally related genes.
When applied to SNP groups organized according to PPI modules that largely represent protein complexes, BayesKAT distinctly prioritizes eight PPI modules as significant, six of which contain significant GWAS SNPs (75%, Figure 6D, Figure 6E Left).On the contrary, only 30%(17/56) and 65%(13/20) of the modules selected by mKU and SKAT(Avg), respectively, contain significant GWAS SNPs (Figure 6D, Figure 6E Right).Taken together, the highly specific prioritizations of potential gene modules and protein complexes, along with the substantially improved justification from other GWAS SNPs, suggest that BayesKAT can facilitate novel discoveries of molecular components involved in complex traits and may pave the way for innovative approaches to disease treatments.

DISCUSSION
BayesKAT (https://github.com/wangjr03/BayesKAT) is a dataadaptive methodology that automatically selects the appropriate composite kernel using the MCMC algorithm (BayesKAT-MCMC) or the optimization technique (BayesKAT-MAP) and conducts hypothesis testing on the presence of group-level genetic associations for complex traits.The Bayesian framework and the inferred posterior probabilities are more interpretable and informative, compared with P-values from frequentist methods.Based on extensive benchmark analyses, BayesKAT demonstrates consistent superior performance than other methods across different settings.Moreover, evaluated on a series of biologically inspired SNP groups based on a real genetic dataset, BayesKAT not only achieves improved prioritization of functionally relevant and justified group-level SNP associations, but also enables novel discoveries with respect to the underlying molecular mechanisms of complex traits.By revealing the collective effects of functionally cooperative SNPs without relying on the prior knowledge of specific kernels, BayesKAT represents one important step forward toward the goal of deciphering the intricate genetic basis of human diseases.
Although some methods based on the Gaussian process [57] or supervised learning technique [58] attempt to select the best kernel using training data for prediction purposes, BayesKAT is the first Bayesian KBT methodology that simultaneously selects the optimal composite kernel while testing for the associations, without requiring the training data.In addition, the dataadaptive strategy of composite kernel selections also facilitates the description of more complicated interdependence structures that cannot be fully captured by individual kernels.Furthermore, BayesKAT provides the f lexibility of incorporating multiple testing corrections, integrating prior biological knowledge, and modeling various data types.To complement the MCMC strategy, BayesKAT-MAP is highly scalable and can be efficiently implemented for large-scale genome-wide studies.
BayesKAT utilizes the Metropolis-Hastings MCMC algorithm in combination with a derivative-free grid-search-based optimization approach to choose the composite kernel for specific datasets.Nevertheless, the BayesKAT framework is not restricted to these techniques.Other efficient MCMC sampling algorithms or reliable optimization techniques can be incorporated.A variety of sampling techniques have been reviewed and compared for different purposes of modeling [59][60][61][62].Integrating these techniques, especially the variational Bayes techniques, into the BayesKAT framework and systematically evaluating their performance for different types of applications will be an important step for future developments (More discussion on this topic is included in the supplementary files).

Key Points
• Bayesian model to automatically select the optimal composite kernels in a data-adaptive way.• BayesKAT consistently boosts the performance of genome-wide genetic association tests for complex diseases.• BayesKAT captures the group-level joint effects of cooperative SNPs.• Implementation of BayesKAT leads to new discoveries of disease-associated SNPs and the underlying molecular mechanisms.• BayesKAT facilitates the mechanistic insights (e.g.genes, pathways, gene modules and protein complexes) into complex phenotypes.

Figure 1 .
Figure 1.Overview of the significance and model design for BayesKAT.(A) A real-world example of an association test with different pre-specified individual kernels(IBS, Gaussian or Quadratic) or the simple average of individual kernels (the Average kernel) leads to inconsistent results.To test the association between the caffeine metabolism pathway and the phenotype of whole brain volume, choosing different kernel functions (or the Average kernel) yield disparate P-values, introducing ambiguity in the final conclusion.In contrast, BayesKAT offers a more interpretable metric (the posterior probability of association) to draw the final conclusion.(B) In a synthetic data cohort simulated assuming a true quadratic function, different kernel functions lead to inconsistent results.Across 500 simulation replicates, each kernel, including the quadratic kernel, yields inconsistent and ambiguous conclusions (barplot).In comparison, BayesKAT generates both interpretable and highly consistent results, with substantially boosted power.(C) Workf low of BayesKAT implementation for diverse types of genetic association tests to derive biological meaningful interpretations.(D) Model structures and the inference algorithms for the two BayesKAT strategies: BayesKAT-MCMC (left) and BayesKAT-MAP (right).BayesKAT-MCMC samples from posterior parameter distributions, providing a comprehensive view of the posterior parameter distributions.On the other hand, BayesKAT-MAP provides a more scalable solution, particularly well-suited for high-dimensional data.

Figure 2 .
Figure 2. Performance comparison based on simulations using continuous features.(A) Schematic summary of the data generation process for continuous molecular features or variables (e.g.gene expression features) and the demonstration of the implementation under various scenarios.(B) Performance comparison across different simulation settings with systematic performance evaluations, i.e. the empirical power versus empirical type 1 error, for SKAT(Avg), mKU and BayesKAT across different scenarios and parameter settings.When increasing p(no of features) while keeping other factors constant, all methods exhibit a slight decline in power, but BayesKAT consistently outperforms SKAT(Avg) and mKU.Additionally, as r(correlation coefficient between features) increases under other conditions fixed, BayesKAT also consistently surpasses SKAT(Avg) and mKU.

Figure 3 .
Figure 3. Performance comparison based on simulations with discrete features.(A) Schematic summary of the data generation process for discrete genetic features (e.g.SNP variants) and the demonstration of the implementation under various scenarios.Starting with the original Z matrix containing groups of SNPs from each of the three randomly selected pathways, a simulated SNP matrix Z is generated, preserving the underlying interrelationships among the SNPs.Covariates (i.e.age, gender and education) are incorporated based on the data from ADNI.(B) Performance comparison was conducted across different simulation settings with sample size n=755 (same as the real data).Systematic performance evaluations are plotted, i.e. the curves of empirical power versus empirical type 1 error averaged over three simulated datasets for SKAT(Avg), mKU, and BayesKAT across different scenarios.The performance curves based on each individual simulated dataset can be found in Supplementary Figure S3.

Figure 4 .
Figure 4. Functional validation of BayesKAT's prioritized genes using orthogonal information.(A) Top-ranking genes prioritized by BayesKAT are strongly supported by previous literature of functional studies of brain-related diseases.(B) The selected genes by BayesKAT demonstrate higher fractions of overlapping meQTL's CpG sites than the genes selected by SKAT(Avg) and mKU.The CpG sites of significant meQTLs from the brain tissues represent orthogonal molecular-level evidence in support of the gene's functional involvement with whole brain volume.

Figure
Figure 5. Pathway-level association tests by BayesKAT prioritizes neurodegenerative disease related pathways.(A) Schematic representation illustrating the steps of pathway-level association tests.Sets of SNPs located within genes belonging to each of the 352 pathways are tested simultaneously for pathway-level associations with the phenotype of interest (the whole brain volume), and the multiple testing correction is also implemented.(B)The topranking pathways prioritized by (i) BayesKAT, (ii) mKU, and (iii) SKAT(Avg) demonstrate distinct enrichment with neurodegenerative disease associated pathways.Top 50 pathways are shown for fair comparison.The top-ranking pathways by BayesKAT are ranked by the posterior probabilities of the pathway-level associations.The top-ranking pathways by the frequentist methods, mKU and SKAT(Avg) are ranked by the -log 10 (P-values).The pathways highlighted in red are neurodegenerative disease related pathways.The red horizontal dashed line in each bar plot indicates the threshold used by each model for fair comparison (see Materials and Methods).The pie charts illustrate the proportion of the selected pathways (above model's thresholds) that belong to the neurodegenerative disease pathways.BayesKAT notably exhibits enhanced prioritization of neurodegenerative disease pathways.Due to the issue of inflated P-values in mKU, pathways with P-values of 0 are assigned -log 10 (P-values)=30 for visualizations.

5 .
Figure 5. Pathway-level association tests by BayesKAT prioritizes neurodegenerative disease related pathways.(A) Schematic representation illustrating the steps of pathway-level association tests.Sets of SNPs located within genes belonging to each of the 352 pathways are tested simultaneously for pathway-level associations with the phenotype of interest (the whole brain volume), and the multiple testing correction is also implemented.(B)The topranking pathways prioritized by (i) BayesKAT, (ii) mKU, and (iii) SKAT(Avg) demonstrate distinct enrichment with neurodegenerative disease associated pathways.Top 50 pathways are shown for fair comparison.The top-ranking pathways by BayesKAT are ranked by the posterior probabilities of the pathway-level associations.The top-ranking pathways by the frequentist methods, mKU and SKAT(Avg) are ranked by the -log 10 (P-values).The pathways highlighted in red are neurodegenerative disease related pathways.The red horizontal dashed line in each bar plot indicates the threshold used by each model for fair comparison (see Materials and Methods).The pie charts illustrate the proportion of the selected pathways (above model's thresholds) that belong to the neurodegenerative disease pathways.BayesKAT notably exhibits enhanced prioritization of neurodegenerative disease pathways.Due to the issue of inflated P-values in mKU, pathways with P-values of 0 are assigned -log 10 (P-values)=30 for visualizations.

Figure 6 .
Figure 6.Boosted association tests based on BayesKAT's composite kernel reveals novel modules of genes and proteins linked to brain volume.(A) The inferred weights of individual kernels in BayesKAT's composite kernel (right) recapitulate the strength of each kernel (-log 10 (P-values)) when each kernel is incorporated separately (left).Without relying on prior knowledge or repetitively testing different kernels separately, BayesKAT automatically infers the appropriate composite kernels to boost the group-level tests for different pathways.(B) Prioritized co-expression gene modules by BayesKAT (left) versus.SKAT(Avg) and mKU (right).The significance threshold of selection for each model is represented by the horizontal red dashed lines (see Materials and Methods).(C) The selected significant co-expression gene modules by BayesKAT demonstrate higher fractions of overlapping with significant SNPs from orthogonal GWAS studies, compared with the results from SKAT(Avg) and mKU.(D) The selected significant PPI modules by BayesKAT demonstrate higher fractions of overlapping with significant SNPs from orthogonal GWAS studies, compared with the results from SKAT(Avg) and mKU.(E) Prioritized PPI modules by BayesKAT (left) versus SKAT(Avg) and mKU (right).The significance threshold of selection for each model is represented by the horizontal red dashed lines (see Materials and Methods).
The most common genetic features are SNPs and the widely used molecular-level features include gene expressions.The features, i.e.Z .j, j = 1, 2, • • • , p are associated with the trait, i.e. y, through an arbitrary function h(•) which is assumed to lie in a function space H K generated by a kernel function K(•, •).